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1.  INTRODUCTION 

Gains  in  computer  performance  can  be 
achieved  through  improvements  at  the 
circuit  level  (eg.  faster  circuitry),  the 
basic  building  block  level  (eg.  more 
powerful  microprocessors),  the  building 
block  interconnection  level  (eg.  better 
computer  system  architecture) ,  and  the 
system  software  level  (eg.  more  effective 
system  software) .  Many  of  these  points  are 
studied  in  [1].  However,  there  are  strong 
relative  dependencies  between  the  levels 
(see  Figure  1),  and  a  full  system 
utilization  of  improvements  at  one  level 
will  usually  require  some  associated 
modifications  at  the  other  levels.  The 
absence  of  both  necessary  system 
interconnection  signals  and  important 
system  software  instructions  in  modern 
building  blocks  are  typical  examples  of 
situations  where  it  is  appropriate  to  make 
adjustments  across  level  boundaries.  Some 
of  the  dependencies   and   their   associated 


modifications   will   be  illustrated  in  this 
paper . 
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Figure  1 
The  four  basic  levels  of  potential 
Computer  System  Performance 
Improvement;  i.e.  circuit  level, 
building  block  level,  building  block 
interconnection  level,  and  system 
software  level. 
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Figure  4 
Average  bus  utilization  and  average  bus 
response  time  as  a  function  of  rhe 
number  of  processors  for  a  pended 
transaction  based  MMPS.  The  graph  also 
includes  the  average  bus  utilization 
for  a  conventional  single  bus  based 
MMPS.     [15] 


2.4   Memory   and    Processor    Contentions 

In  a  single  bus  based  MMPS,  any  number 
of  processors  may  simul caneously  need 
information  from  the  same  memory  module.  In 
addition,  there  is  likely  to  be  an 
exponential-type  distribution  of  the  memory 
load  [15];  i.e.,  certain  areas  of  r.he 
memory  will  be  in  greater  demand  than  other 
areas  (see  Figure  2) .  Since  a  memory  module 
can  only  service  one  request  at  a  time, 
this  situation  may  result  in  a  severe 
contention  among  the  processors  which  are 
requesting  the  use  of  rhe  highly  demanded 
memory  areas.  Similarly,  a  sec  of  memory 
modules  may  like  to  respond  simultaneously 
to  one  particular  processor.  This  creates  a 
processor  contention.  The  memory  and 
processor  ■  contentions,  which  are  often 
referred  to  as  the  "device  busy"  problem, 
may  degrade  the  MMPS  performance.  This 
performance  degradation  will  arise  from 
processors  awaiting  crucial  information 
from  the  highly  demanded  memory  modules, 
and  from  tne  extra  bus  load  which  will  be 
imposed         by  multiple  requests  for 

transmission  to  busy  devices.  Thus,  the 
device  busy  problem  must  bo  dealt  with  in  a 
single   bus-based    MMPS   design. 


The  memory  contention  can  be  dealt 
with  in  two  different  ways.  Danielsson  and 
his  colleagues  (7)  have  suggested  that  the 
memory  space  should  be  divided  into  small 
modules.  This  will  allow  the  memory 
requests  to  be  spread  out  over  a  large 
number  of  independent  modules,  thus 
reducing  the  probability  of  getting 
simultaneous  requests  for  the  same  module. 
Rather  than  using  small  memory  modules,  the 
problem  can  be  solved  using  faster  memory 
circuitry  in  the  highly  demanded  areas. 
This  concept  must  be  accompanied  by  a 
memory  content  migration  schema  which  must 
be  based  on  continuous  memory  traffic 
statistics.  The  idea  of  using  memory 
modules  with  different  speed 

characteristics  is  analagous  to  the 
well-known  cache  concept;  however,  the 
shared-bus  architecture  is  more  flexible 
than  a  standard  cache.  Toong  [15]  has 
studied  tne  memory  speed  part  of  this 
solution  (assuming  a  stationary  memory 
content)  ,  using  an  analytic  model,  and  his 
results    show  promising    effects. 

No  practical  implementation  of  the 
above  suggested  solutions  to  the  memory 
contention  will  eliminate  the  entire 
problem.  The  processors  in  the  system  will 
therefore  continue  to  become  unproductive 
when  they  must  wait  for  crucial  information 
from  the  memory.  Note  that  only  a  portion 
of  the  delayed  information  will  influence 
the    processor    productivity. 

During  the  unproductive  periods,  the 
actual  processors  may  waste  bus  bandwidth 
through  repeated  requests  for  the  needed 
information.  The  waste  of  bus  bandwidth  can 
be  reduced  significantly  by  using  input  and 
output  queues  on  all  of  the  devices  that 
are  connected  to  the  bus.  This  will  allow 
all  of  the  devices  to  transmit  information 
on  the  bus  even  though  the  receiver  should 
be  busy;  i.e.,  the  information  will  be 
stored  in  the  input  queue  until  it  can  be 
processed.  It  will  also  permit  the  devices 
to  keep  on  working  even  if  they  cannot  get 
immediate  access  to  the  bus,  i.e.,  the 
output  information  will  be  deposited  in  the 
output  queue  until  it  can  be  transmitted. 
In  normal  operation,  the  actual  size  of  the 
queues,  which  is  a  system  design  parameter, 
is  not  likely  to  go  beyond  practical 
limits.  According  to  Toong  [13],  who  has 
studied  both  the  memory  speed  and  the  queue 
solutions  to  the  device  busy  problem,  a 
64-level  queue  will  result  in  a  queue 
overflow  probability  on  the  order  of 
10**-i2,  which  is  nearly  zero  for  all 
practical    purposes. 

Design  of  a  reliable  single-bus-based 
MMPS,  which  would  utilize  queues  of 
insufficient  length  to  reduce  the  devJ.ce 
busy  problem,  must  incorporate  mechanisms 
that  will  prevent  queue  overflow.  A 
"queue-full"  signal  can  be  used  to  solve 
this  problem.  To  avoid  wasting  bus 
bandwidth  on  queue-full  conditions,  it 
would      be      necessary    to    incorporate   all    the 


The  physical  characteristics  of  the 
bus  lines  (i.e.,  speed,  length,  etc.)  will 
definitely  influence  the  bus  performance. 
However,  the  effect  of  this  matter  is 
signiticanc  only  for  long  busses  (length  > 
1        ft.)         and/or        for        critical  speed 

requirements.  For  the  purposes  of  this 
paper,  the  bus  lines  will  be  considered  to 
be  short  enough  and  well  conditioned  so 
that  they  do  not  impose  any  significant 
data  transfer  constraints.  The  only 
remaining  physical  speed  constraint,  then, 
is  the  speed  of  the  interface  circuitry. 
This  can  be  solved,  for  all  practical 
purposes,  by  using  high-speed,  uniform 
logic  at  the  bus  interface  (See  Figure  3 
for    illustration)  . 

The  only  remaining  factor  to  consider, 
then,  is  the  protocol  used  on  the  bus  as  a 
limit  to  data  transfer  rates.  The  high  bus 
utilizations  of  the  above  mentioned 
processors  is  primarily  a  function  of  the 
master-slave  based  bus  protocols  tnat  are 
used.  Normally,  the  protocols  are  fixed  at 
the  basic  building  block  design  stage,  and 
cannot  be  changed  after  the  design  is 
completed.  Recent  micro-coded  processors, 
though,  present  the  potential  of  being 
modified  to  optimize  bus  protocol  without 
changing  the  basic  processor.  In  general, 
the  basic  building  block  designers  may  use 
any  bus  protocol.  The  actual  choice  is 
always  the  result  of  a  trade-off  process, 
which,  until  recently,  due  to  low-density 
devices,  could  not  favor  sophisticated  bus 
protocols.  Today,  however,  with  VLSI 
technology  at  hand,  it  should  be  possible 
to  implement  low  bus  utilization  protocols 
without  any  major  penalties  on  the  other 
system   parameters. 

Different  versions  of  a  special  "split 
transaction"  bus  protocol,  'have  been 
proposed  by  various  authors  as  a  general 
solution  to  the  high  bus  utilization 
problem  [3,6,7].  This  type  of  protocol, 
which  is  illustrated  in  figure  3,  splits 
the  regular  master-slave  based  transaction 
(Tstd)  into'  two  subtransactions  (Tl  and 
T2) .  These  can  take  place  disconnected  in 
time  as  a  transaction  initiation  part  and  a 
transaction  completion  part.  Consequently, 
the  bus  will  be  free  for  other  usages 
during  the  asynchronous  wait  interval  (Mt). 
Obviously,  this  implies  that  a  read 
transaction  will  utilize  both  cf  the 
subtransactions,  whereas  a  write  operation 
will  utilize  only  the  first  subtr ansae tion 
(Tl). 

The  actual  bus  protocol  implementation 
varies    in    two   major    areas.    These   are: 

-centralized  versus  decentralized 

control  ,    and 
-synchronous  versus   asynchronous    logic. 

The  trade-off  process  between 

centralized  and  decentralized  control 
involves  topics  of  reliability  (fail-soft), 
modularity,  and  cost.  The  choice  between 
synchronous   and    asynchronous   logic    is     more 


likely  to  involve  performance  trade-offs. 
However,  the  differences  between  the 
various  implementations  has  no  major  impact 
on  performance,  and  a  further  discussion  of 
these    topics    is   not   pertinent   here. 
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Figure  3 
The  Split  Transaction  Bus  protocol 
concept,  (a)  The  pended  Transaction  Bus 
protocol  with  I/O  queues  and  fast 
interface  circuitry,  (b)  the  common 
Master/Slave  transaction,  and  (c)  the 
Pended  Transaction  concept.  [15] 


2.3  The  Pended  Transaction  Bus  Protocol 

A  split  transaction  bus  protocol, 
which  is  called  the  "Pended  Trar.saction  Bus 
Protocol"  (PTBP) ,  is  presently  under 
investigation  at  M.I.T.  This  bus  utilizes 
asynchronous  logic  and  assumes  infinite 
input  and  output  queues  on  all  of  the 
devices  that  are  using  the  bus.  An  analytic 
model  [13]  has  been  developed  to  study  the 
average  bus  response  time  and  the  average 
queue  length  as  a  function  of  the  number  of 
processors  (Pn).  Figure  4  shows  the  average 
bus  utilization  and  the  average  bus 
response  time  as  a  function  of  Pn .  These 
results  indicate  that  50  processors  can  be 
connected  to  the  single  bus  without  any 
severe  contention.  For  comparison  reasons, 
the  figure  also  shows  the  equivalent  bus 
utilization  for  a  conventional  master/slave 
based  MMPS. 

A  fully  interlocked  asynchronous 
version  of  the  PTBP  without  queues  has  been 
implemei'.ted  at  MIT  [3].  Danielsson,  et.  al. 
[7],  and  Lavington,  et.  al.  [6J,  have  both 
developed  a  synchronous  split  transaction 
based  bus.  However,  none  of  these  buses 
have  been  implemented  with  queues.  The 
importance  of  queues  will  be  illustrated 
later  in  this  paper.  Bus  speeds  between 
60ns  and  100ns  have  been  achieved  for  the 
different  buses.  However,  if  these  buses 
were  implemented  using  today's  technology, 
they  could  all  run  at  speeds  of  less  than 
30ns. 


Performance  has  always  been  considered 
the  only  major  drawback  with  the  ningle 
bus.  Interconnection  techniques  to  overcome 
the  performance  problem  have  usually  added 
to  the  system  complexity,  and  thus 
corrupted  the  three  aforementioned 

advantages.  Two  one-way  paths  and  multiple 
two-way  paths  are  among  the  most  frequently 
suggested  modifications  in  terms  of 
performance        gain.        Note  that  these 

alternatives  represent  a  bypass  of  the 
actual   problem   instead  of  a   solution    co    it. 

Danielsson,  et.  al.  [7],  have  shown 
that  the  single  bus  represents  only  a  minor 
performance  degradation,  when  compared  to  a 
multiport  system,  as  long  as  the  bus  is 
relatively  much  faster  (a  factor  of  lOx) 
than  the  devices  that  are  using  the  bus.  In 
other  words,  a  low  bus  utilization  at  the 
building  block  level  is  necessary  for 
successful  operation  of  a  single-bus-based 
MMPS.  This  fact  is  also  expressed  by 
Haagens  [3].  The  bus,  as  used  herein, 
consists  of  the  actual  buslines,  the 
immediate  busline  interface  circuitry,  and 
the   bus   protocol. 

The  above  results  imply  that  multiport 
based  systems  are  probably  the  best 
alternative  in  applications  where 

ultra-high  bandwidths  are  required  and 
where  price  is  only  a  minor  concern.  On  the 
other  hand,  however,  the  results  also 
indicate  that  given  the  necessary  low  bus 
utilizations  at  the  building  block  level, 
the  single  bus  will  probably  become  the 
superior  overall  interconnection  mechanism 
for  a  major  portion  of  the  MMPS  application 
Epectrum. 


2.2   The    Bus    Utilization    Problem 

Given  the  above  promising  prospects 
regarding  the  single  time-shared  bus  as  an 
MMPS  interconnecton  mechanism,  the 

fundamental  problem  now  becomes  that  of 
designing  a  new  bus  which  can  support  such 
an  MMPS  architecture.  Essentially,  the 
single  bus  operates  in  a  two-step  cycle,  as 
follows : 

-devices  with   an    active   bus   need    compete 

for    the   bus    in    an      arbitration      process 

before    the    selected   device    is   connected 

to    the   vacant  bus,    and 
-information      is    transferred   on    the   bus, 

which    is   released   when    the    transfer      is 

finished. 


as  fast  as  the  minimum  bus  information 
transfer  process.  On  the  other  hand,  it 
should  be  noted  that  a  slow, 
non-overlapping  bus  arbitration  process 
will  waste  a  good  portion  of  the  available 
bus  bandwidth. 

The  actual  bus  arbitration  process, 
which  involves  decision  algorithms  based  on 
the  system  priority  structure,  will  not  be 
discussed  in  this  paper.  Most  of  the 
problems  related  to  the  arbitration  process 
are  described  in  the  literature  by  various 
authors  [3,4,8,9,10,11].  In  the  remainder 
of  this  paper,  the  arbitration  and  the 
information  transfer  processes  will  be 
assumed  to  be  overlapping-  In  addition,  it 
is  also  assumed  that  the  arbitration 
process  is  at  least  as  fast  as  the  minimum 
information  transfer  process.  The  bus 
utilization  thus  becomes  solely  a  function 
of  the  information  transfer  process  in  the 
remainder  of  this  paper. 

Previous  generations  of 
microprocessors  (eg.,  Intel  8080/85. 
Motorola  M6B00,  and  Zilog  Z80)  have  very 
high  bus  utilization  [12].  A  set  of  bus 
utilization  values  for  different  processors 
is  given  in  Table  1.  As  a  result  of  the 
high  bus  utilization,  these  processors 
cannot  successfully  work  together  on  a 
single  bus  without  major  modifications 
(i.e.,  between  1.5  and  2.5  non-modified 
processors  will  consume  the  entire  bus 
capacity) . 
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Depending  on  the  actual 
implementation,  the  two  steps  may  or  may 
not  be  "overlapped"  in  time  (i.e.,  the 
"next  bus  master"  may  or  may  not  be 
selected  in  parallel  with  the  "current  bus 
usage").  From  a  bus  throughput  point  of 
view,  the  first  alternative  is  obviously 
the  best.  In  fact,  given  this  alternative, 
the  bus  allocation  mechanism  will  not 
influence  the  bus  throughput  at  all,  if  the 
entire  bus  arbitration  process  is  at   least 


«  8080  uses  bus  during  Tl  (SYNCH)  time. 
■**  8080  does  not  use  bus  during  Tl  (SYNCH)  time. 

Table  1 
Processor  bus  utilizations  for   various 
benchmark  programs  [12]. 


queue  status  and  device-ID  signals  into  the 
arbitration  process.  However,  this  will 
increase  the  arbitration  complexity,  and 
the  cost/per f ormance  potential  of  the 
single  bus  based  MMPS  will  be  severely 
degraded.  Thus,  it  is  not  worth  the  effort 
to  try  to  eliminate  all  of  the  additional 
bus  utilization  (especially  since  the 
potential  gain  is  most  likely  insignificant 
in  any  reasonable  size  queue  system).  The 
problems  associated  with  passing  critical 
real-time  parameters  through  the  queues, 
which  will  impose  an  unpredictable  delay  in 
the  response,  must  also  be  taken  care  of  in 
the  MMPS  design.  Both  the  "repeated 
request"  problem  and  the  problems 
associated  with  real-time  parameters  must 
bo  solved  on  an  individual  system  basis. 

3.  A  MODERN  MICROPROCESSOR  ARCHITECTURE. 

It  will  be  useful  at  this  point  to 
relate  some  of  the  previous  ideas  to  a 
current  microprocessor  architecture  and  a 
vendor  supplied  MMPS  interconnection 
scheme.  The  Motorola  MC68000  will  be  used 
as  a  representative  of  the  current 
microprocessor  technology  available  for 
general  use.  The  MC68000  is  chosen 
primarily  because  of  its  microcoded  nature 
and  associated  potential  of  being  modified. 
Note  that  there  is  no  major  difference 
between  the  MC68000  and  the  other 
alternative  16  bit  processors  (i.e.,  Intel 
8086  and  Zilog  Z8000)  in  terms  of  a  single 
bus  based  MMPS. 

Let  us  first  review  the  architectural 
characteristics  of  the  MC68000.  At  the  chip 
level,  this  newest  generation  of 
microprocessors  presents  significant 
improvements  over  previous  architectures. 
In  particular,  data  paths  to  and  from 
memory  have  been  expanded  to  16  bits,  and 
the  address  ranges  have  been  extended  to 
allow  larger  memory  systems  to  be  directly 
accessed.  All  address  calculations  in  the 
MC68000  are  performed  in  32  bits.  However, 
at  the  present  time  only  24  bits  are 
delivered  'Outside  the  chip.  The  address  is 
delivered  in  4  separate  spaces,  thus 
yielding  64  megabyte  addressability.  All  32 
bits  could  easily  become  available  (given 
more  pinout) ,  and  future  versions  of  the 
processor  will  definitely  be  implemented 
with  the  full  32  bit  address  space.  The 
MC68000  uses  memory  mapped  I/O.  This  allows 
all  memory  referencing  instructions  to  be 
used  for  I/O  referencing  as  well,  thus 
saving  instructions  at  the  expense  of  I/O 
protection  at  the  instruction  level.  As  a 
result,  the  I/O  protection  must  be 
performed  at  the  memory  protection  level. 
The  large  memory  space  virtually  requires 
some  form  of  management.  The  MC68000  uses 
an  external  memory  management  chip.  This 
allows  increased  function  by  increasing 
silicon  area.  The  management  unit  can 
relocate,  check  bounds,  and  function  check 
all  references  to  support  very 
sophisticated  memory  mapping  and  protection 
facilities. 
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Most  of  the  above  processor  features 
would  obviously  be  very  useful  in  a 
multiple  microprocessor  environment. 
However,  the  problem  of  interconnection 
remains  to  be  solved  before  the  features 
will  have  any  value  in  the  area  of  MMPS. 


3.1  Microprocessor  Timing 

The  MC68000  bus  utilization  depends  on 
the  instruction  and  data  fetching  process. 
Actual  bus  loading  numbers  are  not 
available  at  the  present  time.  However, 
sufficient  bus  utilization  estimates  can  be 
made.  The  MC68000  has  three  basic  bus 
operations.  These  are: 

(Each   state   is  one  half  clock  cycle,  or 
62.5  ns  at  8  MHz.) 

-Read:  8   bus   states   plus   wait   state 

pairs  as  required 
-Write:   10   bus   states  plus  wait  state 

pairs  as  required 
-Read-Hodify-Wr ite:  13  bus   states   plus 
wait  state  pairs  as  required 
A  detailed   description  of   the  basic  bus 
operations  along  with  the   timing   diagrams 
is  g  iven  in  [ 14) . 

Most  MC68000  instructions  are  memory 
fetch  time  bound.  In  addition,  CPU  and 
memory  operations  are  carried  out  in 
parallel.  Therefore,  the  processor  is 
almost  always  on  the  bus  doing  a  read  or  a 
write  oosration.  In  some  addressing  modes, 
requiring  a  full  32  bit  add,  the  bus  is 
idle  for  one  basic  machine  cycle  (i.e., 
250ns  at  8  Mhz) .  Some  instructions, 
however,  such  as  RESET,  MULTIPLY,  DIVIDE, 
SHIFT,  and  ROTATE  are  CPU  intensive.  These 
instructions,  which  are  less  frequently 
encountered,  do  not  require  any  external 
accesses  during  execution.  Overall,  though, 
it  is  reasonable  to  assume  that  a 
continuous   stream  of  reads  and  writes  will 


be  performed  by  each  processor  under 
execution.  The  bus  cycle  description 
implies  that  a  processor  uses  the  bus  for 
all  but  one  of  the  states  in  each  of  the 
basic  bus  operations.  If  one  were  simply  to 
connect  two  processors  on  a  single 
time-shared  bus,  they  could  overlap  bus 
dependent  operations  (execution)  only 
during  that  single  state.  Thus,  with 
respect  to  overall  bus  utilization,  there 
is  no  major  difference  between  the  old 
Motorola  M6800  and  the  new  MC68000.  Bus 
utilization  values  like  those  shown  for  the 
16  bit  processors  (LSI-11  and  TI9900)  in 
Table    1    should    be    expected    for    the    MC6S000. 

From  the  above  description  it  can  be 
seen  that  each  processor  inherently  uses 
the  bus  it  is  connected  to  for  a  very  large 
portion  of  the  time.  Any  attempts  to 
connect  more  than  one  processor  on  one  such 
bus  would  not  yield  a  significant  increase 
in  computation  power.  This  is  because  each 
processor  would  have  to  be  proportionately 
slowed  down  in  order  to  get  access  to  the 
bus.  Obviously,  the  above  conclusions  are 
highly  dependent  on  the  instruction  mix. 
However,  the  overall  result  is  that,  at 
best,  2  processors  would  profit  on  a  single 
bus.    That    is    far    from   a   promising    result. 


local  bus  contains  the  needed  information 
is  stopped  and  the  data  is  passed  back  to 
the  requester,  who  then  relinquishes  the 
global    bus. 
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Figure    5 
Multi-MC68000  system  using 

Arbitration   Modules    (BAMs). 
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3.2  Vendor  Supplied  Multiprocessing 

The  microprocessor  manufacturers  are 
aware  of  the  high  bus  utilization  problems, 
and  they  have  devised  plans  to  allow  many 
of  their  CPUs  to  be  connected  together  in  a 
MMPS.  The  designs  differ  widely  from  vendor 
to  vendor.  Again,  the  actual  analysis 
depends,  to  a  large  extent,  on  the  bus 
usage  of  any  single  processor.  In  addition, 
it  also  depends  on  the  amount  of  crosstalk 
traffic  between  processors  and  various 
memory  and  I/O  elements,  i.e.,  the  total 
amount  of  data  to  be  transmitted  over  a 
common  communication  system. 

In  the  Motorola  interconnection 
system,  each  processor  is  set  up  with  a 
local  bus  with  local  memory  and 
peripherals.  In  addition,  there  is  a  Global 
bus  connecting  all  the  local  buses  together 
through  Bus  Arbitration  Modules  (BAMs).  The 
system  is  illustrated  in  Figure  5.  Each 
processor  is  free  to  execute  at  full  speed 
within  its  own  local  space  until  either  it 
needs  something  in  another  local  bus  area, 
or  some  local  information  is  requested  from 
the  BAM  unit.  Any  access  off  the  local  bus 
onto  the  Global  bus  takes  more  time  than  a 
simple  local  access.  Accesses  from  the 
Global  bus  bock  onto  a  local  bus  are  done 
using  a  DMA  operation.  There  are  no 
strictly  global,  shared  resources  in  the 
Motorola  definition.  The  BAMs  opeiate  as 
memory  manager  devices.  Each  recognizes 
off-local  accesses,  and  requests  the  global 
bus  to  get  access  to  that  resource,  which 
is  assumed  to  be  on  some  other  local  bus. 
When  granted  that  bus,  it  makes  the 
request,  and   the   processor   system   whose 


It  seems  evident  that  the  BAM  solution 
is  designed  for  low  non-local  access  rates. 
The  time  required  to  acquire  the  necessary 
communication  path  (local-global-local)  and 
transfer  the  data  is  relatively  long. 
Furthermore,  the  global  bus  can  only  have 
one  transaction  going  on  at  a  time, 
ignoring  all  other  requests  until  the  bus 
is  again  free.  Overall,  this  process  is 
relatively  simple,  'since  there  is  no 
pending  of  operations  or  other  dynamic 
considerations.  The  bus  switches  supply  a 
complete  comiTiunication  connection  and  hold 
it  as  long  as  the  entire  transaction 
requires.  A  major  difficulty  with  this 
scheme  is  that  there  is  nothing  to  prevent 
several  processors  from  making  continuous 
accesses  into  one  processor,  effectively 
stopping  that  processor  entirely.  The 
priority  on  the  global  bus  appears  to  be 
fixed,  so  the  processor  with  the  least 
priority  may  never  get  a  global  transaction 
completed . 
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An  effective  speedup  could  be  achieved 
using  global  memory  attached  to  a 
cated  BAM  type  unit,  with  no  other 
essor  local  to  it.  In  this  case, 
ing  else  would  be  contending  for  the 
1  bus,  and  any  request  from  the   Global 

could  be  granted  immediately  without 
arbitration.  As  soon  as  a  processor  got 
Global  bus,  it  could  be  routed  directly 
lobai  memory,  and  the  data  could  be 
rned  fairly  quickly.  Global  memory 
d  then  be  truly  shared,  but  access  to 
would  be  limited  in  rate  by  the  global 
bandwidth,  which,  by  definition,  cannot 
ny  higher  than  the  data  rate  on  a 
lo    local    processor   bus.   Thus,   a 


processor  executing  a  common  routine  in 
global  memory  could  tie  it  up  completely. 
As  the  processors  are  assigned  a  fixed 
priority,  a  complete  lockout  of  the  other 
requests   would   occur. 


3.3   Split   Transaction   Multiprocessing 

In  the  previous  MC68OO6  designs,  buses 
were  used  to  supply  complete  communication 
links  between  one  master  and  one  slave 
device;  the  master  would  ask  the  slave  to 
do  something,  and  then  wait  until  the  slave 
responded.  To  allow  more  extensive  sharing 
of  resources,  a  higher  speed  communication 
mechanism  is  needed  betv/een  devices.  In 
this  regard,  it  becomes  desirable  to 
connect  several  MC68000  processors  to  a 
split  (especially  Pended)  transaction  bus. 
One  'very  important  multiprocessing  point 
must        be        brought        out        here.         In  a 

multiprocessing  system,  processor 

synchronization  requires  some  form  of  an 
"atomic"  cest-and-set  primitive  to 

implement  resource  allocation.  In  the 
MC68000,  this         is        performed        via        a 

read-modif y-wr ite  operation,  which  is 
formed  of  connected  read  and  v;rite 
operations.  In  a  split  transaction  bus,  the 
first  read  cycle  can  be  started,  and  other 
bus  operations  can  occur  while  the  first 
read  is  completing.  If  one  of  those 
operations  happens  to  be  to  the  current 
location  being  read  by  the  test-and-set , 
then  it  can  get  to  it  between  the 
test-and-set  cycles  if  careful  nardware 
measures  are  not  taken,  which  would  destroy 
the  intended  operation  of  the  test-and-set 
operation.  This  problem  is  particularly 
evident  in  a  Pended  bus  with  queues,  where 
the  queue  may  buffer  up  many  requests  to 
many   locations. 

Virtually  any  processor  can  be 
connected  to  a  split  transaction  bus 
(possibly  through  some  external  control 
logic) ,  provided  it  creates  some  form  of 
request  signal  (such  as  an  address  strobe) , 
and  has  an  asynchronous  signal  to  denote 
transaction  completion  (wait  or 

acknowledge).  On  pfocessors  where  the  split 
transaction  protocol  must  be  created  from 
such  external  signals,  the  timings  may 
degrade  processor-to-memory  performance 
figures.  In  the  HC68000,  the  protocol  can 
be  created,  but  the  timing  is  less  than 
optimal.  If  the  slaves  are  memory  fast 
enough  to  supply  the  processor  without  wait 
states  if  connected  directly,  then  a  split 
transaction  system  will  have  to  insert  two 
to  four  wait  states  [14]  into  the  processor 
read  cycles  due  to  the  delays  associated 
with  proper  address  strobing  and  data 
acknowledge.  The  addition  of  the  wait  state 
pairs  results  in  a  net  slow-down  of  up  to 
50  percent  for  each  cycle  of  each 
processor.  By  far  the  worst  difficulty, 
however,  is  that  tlie  MC68000  test-and-set 
operation  simply  makes  a  read  operation 
followed  by  a  write  operation.  It  is 
impossible    to    tell,    until    the    read    is   done. 


that  a  read-modify-wr ite  operation  is  in 
progress,  which  is  too  late  to  stop 
interference  [14].  This  problem  can  only  be 
solved  by  a  change  in  the  protocol 
microcode  to  enable  an  output  signal  to 
give  an  earlier  indicator.  The  only  other 
choice  is  to  ignore  the  test-and-set 
software  instruction  and  to  create  the 
indivisible  operation  in  hardware 

test-and-sel f-se t    flags. 
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interconnection  problems  have 
sed      in      terms      of      the      single 

bus.  The  performance  drawbacks 
ociated  with  the  single  bus 
has  been  attributed  to  the  high 
ion  at  the  basic  building  block 
Pended  Transaction  Bus  protocol 
sented    as   a   general    solution    to 

utilization  problem.  It  has 
ted  that  a  properly  designed 
shared  bus  can  support  more 
rocessors      without      any      severe 


The  basic  bus  control  protocol  of  the 
MC68000  has  been  investigated.  It  has  been 
shown  to  effectively  curtail  significant 
multi-microprocessing  because  of  high  bus 
utilization.  Although  the  low  bus  bandwidth 
is  highly  dependent  on  the  instruction  mix, 
severe  throughput  degradation  results 
because  of  an  inefficient  protocol  design 
at    the    processor    level. 

The  vendor-supplied  multiple 

microprocessor  interconnection  system  has 
also  been  reviewed.  A  consensus  view  is 
that  the  global  .system  bus  can  be  easily 
saturated  by  the  requests  of  a  single 
high-priority  processor.  The  proposed 
Motorola  design  is  best  aimed  at  low-volume 
global  transactions  where  each  processor 
relies  heavily  on  its  local  resources.  The 
Local/Global        bus        solution        would  be 

inappropriate  and  inefficient  for  a  large 
(>5)  number  of  processing  elements,  with 
high   bus   bandwidth   requirements. 

A  pended  transaction  protocol  is  one 
approach  for  increasing  the  available 
bandwidth  of  such  a  time-shared  bus. 
However,  the  process  of  adapting  a 
standard,  high  bus  usage  microprocessor  to 
such  a  Pended  multiprocessor  scheme  still 
has  several  difficulties  which  give  such  an 
approach  limited        practicality.        These 

problems  include  additional  wait  states  in 
the  processor  cycle,  and  inability  to 
guarantee  proper  operation  during  a 
read-modify-wr ite  sequence.  Since  the 
microprogrammed  nature  of  the  MC68000 
allo\'G  modification  of  the  addressing 
control  structures,  it  would  seem  possible 
to.  properly  implement  the  Pended  Bus 
Protocol  directly  from  the  chip  without 
radically  '  altering  the  chip^s  internal 
structure.  Practical  multiprocessing 

capabilities  could  be  implemented  easily 
and    directly. 


The    implications  of   such      an      improved  [131-Toong,    Hoo-min    D. ,      and      Gupta,      Amar, 

system        interconnection      architecture      for  " IMMPS-Incerac tive             Multi-Micro- 

high      bandwidth        applications,        such        as  processor    Performance    System,"      MIT 

real-time             industrial             control             or  CISR     Technical    Report   #6,    December 

computation-intensive      tasks,        are        truly        '  1979. 

significant.        Computer      control       functions  (14]-Toong,    Hoo-min    D. ,    Strommen,      S. ,      and 

such        as        multi-dimensional,          multi-arm  Goodrich      II,      E.      et.    al.,    "Archi- 

machine        tool        operations        can        now     be'  tectural        Comparisons:      "MC68000, 

accomplished      in      a        modular,        expandable  Z8000,      8086,"      MIT     CISR  Technical 

fashion      that      can    still    accomodate    current  Report    1(5,    December    1979. 

technology  building    blocks.    Such    techniques  [15]-Toong,    Hoo-min    D. ,    work    in    progess,    to 

are    also   extendable      to      other      application  be   published, 
areas. 
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